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Abstract 

Efficient embedding virtual clusters in physical network is a challenging 
problem. In this paper we consider a scenario where physical network has 
a structure of a balanced tree. This assumption is justified by many real- 
world implementations of datacenters. 

We consider an extension to virtual cluster embedding by introducing 
replication among data chunks. In many real-world applications, data 
is stored in distributed and redundant way. This assumption introduces 
additional hardness in deciding what replica to process. 

By reduction from classical NP-complete problem of Boolean Satisfia¬ 
bility, we show limits of optimality of embedding. Our result holds even in 
trees of edge height bounded by three. Also, we show that limiting repli¬ 
cation factor to two replicas per chunk type does not make the problem 
simpler. 


1 Introduction 

Server virtualization has revamped the server business over the last years, and 
has radically changed the way we think about resource allocation: today, al¬ 
most arbitrary computational resources can be allocated on demand. More¬ 
over, the virtualization trend now started to spill over to the network: batch¬ 
processing applications such as MapReduce often generate significant network 
traffic (namely during the so-called shuffle phase) [18], and in order to avoid 
interference in the underlying physical network and in order to provide a pre¬ 
dictable application performance, it is important to provide performance isola¬ 
tion and bandwidth guarantees for the virtual network connecting the virtual 
machines. [10] 

Prominent example of large-scale framework that is used in datacenters is 
MapReduce. In such applications often network usage becomes the limiting 
factor for application performance. Sharing single link among multiple node¬ 
node communications requires reserving certain amount of network traffic to 
avoid slowing the transfer down. In order to model MapReduce execution, 
bandwidth has to be reserved for transfer of chunks to nodes and for node-node 
interconnection. Resulting virtual network to be embedded consists of clique 
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and single links incoming to its vertices. Described abstraction is called virtual 
cluster [3, 16]. 

Our Contributions. We show that minimizing network footprint is NP- 
hard in presence of multiple replicas of the same chunk type. Moreover, we 
show that NP-hard problems already arise in small-diameter networks (as they 
are widely used today [ 1 ]), and even if the number of replicas is bounded by 
two. 


2 Model 

To get started, and before introducing our formal model and its constituting 
parts in detail, we will discuss the practical motivation. 

2.1 Background and Practical Motivation 

Our model is motivated by batch-processing applications such as MapReduce. 
Such applications use multiple virtual machines to process data, initially of¬ 
ten redundantly stored in a distributed file system implemented by multiple 
servers. [ 6 ] The standard datacenter topologies today are (multi-rooted) fat-tree 
resp. Clos topologies [1, 8 ], hierarchical networks recursively made of sub-trees 
at each level; servers are located at the tree leaves. Given the amount of multi¬ 
plexing over the mesh of links and the availability of multi-path routing protocol, 
e.g. ECMP, the redundant links can be considered as a single aggregate link for 
bandwidth reservations [3, 16]. 

During execution, batch-processing applications typically cycle through dif¬ 
ferent phases, most prominently, a mapping phase and a reducing phase; be¬ 
tween the two phases, a shuffling operation is performed, a phase where the 
results from the mappers are communicated to the reducers. Since the shuffling 
phase can constitute a non-negligible part of the overall runtime [4], and since 
concurrent network transmissions can introduce interference and performance 
unpredictability [18], it is important to provide explicit minimal bandwidth 
guarantees [10]. In particular, we model the virtual network connecting the vir¬ 
tual machines as a virtual cluster [3, 10, 16]; however, we extend this model with 
a notion of data-locality. In particular, we distinguish between the bandwidth 
needed between assigned chunk and virtual machine (5i) and the bandwidth 
needed between two virtual machines ( 62 ); in practice, for applications with a 
large “mapping ratio” where the mapping phase already reduces the data size 
signihcantly, it may hold that b^^bi. 

2.2 Fundamental Parts 

Let us now introduce our model more formally. It consists of three fundamen¬ 
tal parts: ( 1 ) the substrate network (the servers and the connecting physical 
network), (2) the to be processed input (the data chunks), and (3) the virtual 
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network (the virtual machines and the logical network connecting the machines 
to each other as well as to the chunks). 

The Substrate Network. The substrate network (also known as the host 
graph) represents the physical resources: a set S' of ng = [S'! servers intercon¬ 
nected by a network consisting of a set R of routers (or switches) and a set E 
of (symmetric) links; we will often refer to the elements in S U i? as the ver¬ 
tices. We will assume that the inter-connecting network forms an (arbitrary, 
not necessarily balanced or regular) tree, where the servers are located at the 
tree leaves. Each server s € S can host zero or one virtual machine. Each link 
e G E has a certain bandwidth capacity cap{e). 

The Input Data. The to be processed data constitntes the input to the 
batch-processing application. The data is stored in a distributed manner; this 
spatial distribution is given and not subject to optimization. The input data 
consists of T different chunk types {ci,.. . ,Cr}, where each chunk type Ci can 
have Vi > 1 instances (or replicas) ..., stored at different servers. 

It is sufficient to process one replica, and we will sometimes refer to this replica 
as the active (or selected) replica. 

The input data is stored redundantly, and the algorithm has the freedom to 
choose a replica for each chunk type, and assign it to a virtual machine (i.e., 
node). 

The Virtual Network. The virtual network consists of a set V of ny = \ V\ 
virtual machines, henceforth often simply called nodes. Each node v G V can be 
placed (or, synonymously, embedded) on a server; this placement can be subject 
to optimization. 

Please note that number of nodes might exceed the number of chunk types. 
Excessive machines (or idle) do not process chunks, but participate in shuffle 
and reduce phase of Map-Reduce. Excessive machines do not have matched 
chunks, therefore their transportation cost is zero. Every machine, idle or not 
- incline communication cost to other machines. 

We will denote the server s hosting node v by ^{v) = s. Since these nodes 
process the input data, they need to be assigned and connected to the chunks. 
Concretely, for each chunk type Ci, exactly one replica must be processed 
by exactly one node n; which replica c\ is chosen is subject to optimization, 
and we will denote by p, the assignment of nodes to chunks. 

In order to ensure a predictable application performance, both the connec¬ 
tion to the chunks as well as the interconnection between the nodes may have 
to ensure certain minimal bandwidth guarantees; we will refer to the first type 
of virtual network as the (chunk) occess network, and to the second type of 
virtual network as the (node) inter-connect] the latter is modeled as a complete 
network (a clique). Concretely, we assume that an active chunk is connected to 
its node at a minimal (guaranteed) bandwidth bi, and a node is connected to 
any other node at minimal (guaranteed) bandwidth 62 - 

We allow the number of chunk types to be smaller than the number of nodes. 
The “idle” nodes however do participate in the inter-connect communication (in 
practical terms: in the shuffle phase and the reducing phase). 
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2.3 Optimization Objective 

Our goal is to develop algorithms which minimize the resource footprint: the 
guaranteed bandwidth allocation (or synonymously: reservation) on all links of 
the given embedding; note that only the resource allocation at the links but not 
at the servers depends on the replica selection or embedding. Thus, we on the 
one hand aim to embed the nodes in a locality-aware manner, close to the input 
data (the chunks), but at the same time also aim to embed the nodes as close 
as possible to each other. 

Formally, let dist(v, c) denote the distance (in the underlying physical net¬ 
work T) between a node v and its assigned (active) chunk replica c, and let 
dist{vi,V 2 ) denote the distance between the two nodes vi and V 2 - We define the 
footprint F(v) of a node v as follows: 

F(u) = bi ■ dist{v, c) + - ■ 62 • dist{v, v'), 

cefi(v) 

'-V-^ '-V-^ 

transportation inter-connect 

where /i(u) is the set of chunks assigned to v. Our goal is to minimize the overall 
footprint F = F(^)- 

2.4 Decision problem 

In order to perform a reduction from NP-complete problem, we need to trans¬ 
form our optimization problem to decision problem. To do so, we define EMB 
as a set containing pair {k, I}, iff. I is an instance of virtual cluster embedding 
problem that has feasible (bandwidth-respecting) solution of cost < k. 

3 Hardness of problem with multiple replicas al¬ 
lowed 

We prove that EMB is NP-hard by reduction from the Boolean Satisfiability 
Problem (Sat). Since Sat is a decision problem, we introduce a cost threshold 
Th to transform EMB into a decision problem too. 

Let’s first recall that the Sat problem asks whether a positive valuation 
exists for a formula T with a clauses and (5 variables. In the following, we will 
only focus on Sat instances of at least four variables; this Sat variant is still 
NP-hard. 

Construction. Given any formula d* in Conjunctive Normal Form (CNF) 
with a clauses and /3 > 4 variables, we produce a EMB instance as follows: 
First, we construct a substrate tree T,i<, consisting of a root and separate gadgets 
for each variable of T, each of which is a child of the root. The gadget of variable 
ly consists of root(v) and its two children: positivelv) and negative{v). Child 
positive{v) has a many children labeled i'i,V 2 , ■ ■ ■ ,Va, and child negative{v) 
has a many children labeled -^vx,^V 2 t ■ ■ Every gadget has the same 
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The gadgets for the 
other variables have 
the same structure. 


Figure 1: The construction of the variable gadget for v. If v appears in the first 
clause, a chunk ci, will be located at ui. If satisfies the last clause, —•Va will 
host Cq. 


structure: the same height and the same number of leaves. This construction is 
illustrated in Table 1. 

For all variables we set the bandwidth on the link, connecting root{v) to 
the root of the substrate tree, to a ■ {a ■ P — a). The bandwidth on the other 
edges is not limited. 

We set the number of nodes to nv = a ■ [3. Moreover, we define the inter¬ 
connect communication cost to be 1, and the access cost to be a sufficiently large 
constant W, such that nodes must always be collocated with chunks {W = Th+1 
is sufficient). 

We set the number of chunk types to be equal to the number of clauses, 
T = a. To finish our construction, we place data chunks at leaves, as follows: 
for the i-th clause we construct as many replicas of chunk Ci as there are literals 
in the clause. For each literal i (of the form i/ or that satisfies clause i, we 
place a replica of chunk Ci in the leaf labeled £i. 

Note, that in this construction some nodes will be idle. No chunks will 
be assigned to these nodes, but they will nevertheless participate in the node 
interconnect. 

We set the threshold Th to: Th = /3 • ((“) • 2 + a • (a • /3 — a)). 

Proof of correctness of construction. We now show that our construc¬ 
tion indeed decides Sat. We set the capacities such that in every gadget, at 
most a nodes can be mapped, where a is the number of clauses of . We can 
apply the Bandwidth Lemma (Lemma 6) as follows: We interpret as the num¬ 
ber of nodes that are embedded in the f-th gadget, a as the number of clauses, 
and j3 as the number of variables. The LHS of the inequality of Lemma 6 is 
a formula for the communication cost of nodes inside the f-th gadget to nodes 
outside the gadget. The RHS of the inequality is the bandwidth constraint for 
the gadget. This implies that any feasible solution must embed exactly a nodes 
in every gadget. Recall that in our Sat instance, we have at least four variables. 

Theorem 1. The problem EMB is NP-hard. 
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Proof. We will prove that formula 'f' is satisfiable iff EMB has a solution of cost 
< Th. 

(=>) Let us take any valuation Val that satisfies 4'. We will construct a 
solution to EMB using Val in the following way. For each variable i/ in we 
embed a many nodes at the leaves of the gadget of v. We need to choose a out 
of 2 • a leaves to embed nodes. If Val{iy) = 1, we embed nodes at the leaves of 
positive{u), else we embed all nodes at leaves negative{v). The solution con¬ 
structed this way has cost exactly Th, because the nodes are evenly split among 
gadgets, and nodes are not distributed across positive{y) and negative(y) sub¬ 
trees. 

We calculate the chunk-node matching p by assigning every chunk to the 
node which is collocated with the first chunk replica. This solution is feasible 
because every clause of was satisfied and chunks correspond to clauses. 

Now we will show that this solution has cost Th. Due to the Bandwidth 
Lemma (Lemma 6), we only have to consider the communication cost. We sum 
inner-gadget communication and communication among gadgets to get exactly 
Th. 

(<^=) Let us take any solution to EMB constructed based on tj/ of cost < Th. 
We will construct a positive valuation Val by considering the nodes in the 
solution to EMB. 

We make the following observations. In every solution of cost < Th, ev¬ 
ery gadget has exactly a many nodes at its leaves. This is due to the Band¬ 
width Lemma (Lemma 6). Also, inside every gadget either all nodes are in the 
positive{v) subtree of variable v, or in the negative(y) subtree. This is true 
because the cost of a solution where at least one gadget has nodes distributed 
across subtrees is always greater than Th. 

Now we can construct our valuation Val, as follows (for each variable v in 
tk): If vi hosts a node then Val{v) = I, otherwise Val{v) = 0. 

The valuation Val satisfies all clauses, and hence 4', as the solution to EMB 
covers all chunks. To see this, consider the leaf which hosts a node which is 
assigned to any given chunk (i.e., the leaf handling any given clause chunk); it 
is a witness that the corresponding clause is satisfied. 

□ 

We conclude by observing that our construction leverages the fact that the 
number of nodes may exceed the number of chunk types, e.g., for a clause 
[xV y y z) in tk, both x and y being true implies the mapping of nodes on 
vertices labeled Xi and yi, and which contain the same chunk ci. 

3.1 Hardness of problem with two replicas of each type 

We can see that proof from previous section can be carried from 3SAT. This 
way we need only three replicas of each chunk type. 2SAT is not NP-hard, not 
allowing to carry previous construction for two replicas of each type. In this 
section we will show how to modify the construction to show NP-hardness of 
problem constrained to have at most two replicas of each type. 
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Our results so far indicate that dealing with replication can be challenging. 
However, all our hardness proofs concerned scenarios with three replicas, which 
raises the question whether the problems are solvable in polynomial time with a 
replication factor of 2. (Similarly to, say, the 2-Sat problem which is tractable 
in contrast to 3-Sat.) 

In the following, we show that this is not the case: the problem remains 
NP-hard, at least in the capacitated network. 

The proof is by reduction from 3 -Sat. Given a formula 'k in conjunctive 
normal form, consisting of a clauses and (3 variables, we construct a problem 
instance and substrate tree using two types of gadgets: gadgets for variables 
and gadgets for clauses. Nota hene: unlike in the previous proof we will create 
three chunk types instead of just one, for every clause. 



Figure 2: Structure of clause gadgets. 

Construction. We build upon the construction for variable gadgets intro¬ 
duced (see also Figure 1). 

1. Tree Construction: In addition to the variable gadgets known from the 
previous construction, we introduce clause gadgets. The clause gadget 
for a clause C (illustrated in Figure 2) has two inner vertices: root{C), 
middle{Cy and three leaves Ci,C 2 and C 3 . We connect leaves to the 
middle vertex, and the middle vertex to root{c). We attach the gadget to 
the tree by linking directly the global root to root{C). We construct our 
tree out of (3 variable gadgets and a clause gadgets. 

2. Chunk Distribution: For each clause C, we generate 3 chunks types with 
2 replicas each. Each server in the clause gadget of C holds a replica of a 
different chunk type. The remaining replicas of chunk types, are placed in 
the variable gadgets of the variables which satisfy the clause, similarly to 
the previous proof. Thus, in total, 6 • a variable chunks are distributed in 
the substrate network. We will consider a setting where a - f3 -\-2a nodes 
need to be mapped. Our intention is that in every variable gadget, there 
will be a nodes, and in every clause gadgets there will be two nodes. 

^The only purpose of the middle vertex is to maintain the balanced tree property. 
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3. Bandwidth Constraints: The available bandwidth of the top edge of the 
gadget of each variable v is set to cap{iy) = a{a{/3 — 1) + 2 • a). This value, 
results from a nodes in the gadget for v, which each have to communicate 
to a • (/? — 1 ) nodes in other variable gadgets and 2 • a nodes in clause 
gadgets. The available bandwidth for the top edge of each clause gadget 
is set to cap{a) = 2(a-/3 + 2(a— 1)). This value allows both of the 2 nodes 
in a clause gadget, to communicate to the a • /3 nodes in variable gadgets 
and the the other 2(a — 1 ) nodes in clause gadgets. 

4. Additional Properties: We set the threshold in a similar fashion as in pre¬ 
vious proofs. That is, the threshold depends on the intra-clause commu¬ 
nication cost (2 hops), and the inter-clause communication cost (6 hops). 
We set the number of nodes to be placed to a • /3 -I- 2 • a. We set the hosting 
capacity of each server to 1, and set bi = Th+1 to disallow remote chunk 
access. We set 62 = 1. 

Proof of correctness. We first prove the following helper lemma. 

Lemma 2. Every valid solution to EMB(2) with eost at most Th has the 
property that there are exaetly a nodes in eaeh of the /3 variable gadgets and 
exaetly two nodes in eaeh of the a elause gadgets. 

Correctness follows from the extended bandwidth lemma (Lemma 7). 
Theorem 3. EMB(2) is NP-hard. 

Proof. We show that EMB(2) has a solution of cost < Th if and only if 'k G 
3-Sat is satisfiable. 

(=>) If we have a positive valuation of 4', we fill variable gadgets with nodes 
like in the proofs before. Then we place 2 nodes in each of the a clause gadgets as 
follows: Given a clause C = ^iV ^2 V^a, we pick an arbitrary literal which satisfies 
the clause. Subsequently we place nodes at the leaf nodes in the clause gadget, 
which correspond to the other two literals. This strategy ensures that all chunks 
can be assigned to collocated nodes, as the only chunk type, which cannot be 
assigned to a collocated node in the clause gadget, has a node collocated with 
its second replica in the variable gadget. 

We will then assign chunks to nodes in the following way: For chunk type we 
assign the replica in the variable gadgets to a collocated node. If this node does 
not exist, we assign the replica in the clause gadgets, to its collocated node. 

Thus, we have produced a feasible solution of cost Th. (-4=) Let us take any 
solution SOL to EMB(2) of cost < Th. Similar to the proof of Theorem 1 
and Lemma 6 all nodes which are placed in a variable gadgets, will be located 
in either the positive or the negative subtree. Then we can compute a positive 
valuation by setting each variable v as follows: 

{ 1 iff there is a node at the first leaf 
on positive side of v gadget in SOL 
0 otherwise 

The theorem now follows from the following two additional lemmas. 


Lemma 4. For every clause there exists a node in a variable gadget that pro¬ 
cesses one of three chunks that correspond to that clause. 

Proof. Each of the three chunks that correspond to each clause, is assigned a 
collocated node. At least one of those three nodes is not idle in a variable 
gadget; otherwise, those two nodes in the clause gadgets would not suffice in 
satisfying all chunk types. □ 

Observe that it might happen that in SOL, two nodes in clause variables 
are idle, and three nodes in variable gadgets are processing those 3 chunk types. 
In this case, arbitrary nodes can be taken for the rest of the proof. 

Lemma 5. Val satisfies 4'. 

Proof. Let us consider the matching M of SOL, and let us consider an arbitrary 
clause of 4' as well as its three chunk types: Due to bandwidth constraints, at 
most two of the chunks types, can be processed by nodes in the clause gadgets. 
We identify any chunk type, which is not assigned to a replica in the clause 
gadgets. The processed replica of that chunk type was located in a variable 
gadget. Depending on whether the replica was located in the positive or the 
negative subtree, we set the value of the according variable to 1 (positive subtree) 
or 0 (negative subtree). □ 

□ 


4 The Bandwidth Lemmas 

Lemma 6 (Bandwidth Lemma). Let a and j5 > A he two arbitrary positive 
integers. Let ai,a 2 ,.. ■ ,ap he a sequence of (3 integers which adds up to a • (3. 
Also, for each i we have Ui <2 ■ a. Then it holds that if 

Vi : Qi • {a ■ f3 — Oi) < a ■ {a ■ l3 — a), 

then for each i: ai = a. 

Proof. By contradiction. Let us assume that there exists an index k such that 
o/c 7 ^ a. Then we can distinguish between two cases: either ak < a or ak > a. 

Case Ofc < a: If there exists a k with ak < a, due to the fact that the 
sequence adds up to a • (3, there must also exist a k' such that ak' < a (by a 
simple pigeon hole principle). Thus, this case can also be reduced to the second 
case (Case ak > a) proved next. 

Case ak > a: Since it also holds that ak < 2a, ak must be of the form a-\-x 
for a; G Let us consider the (bandwidth) inequality: 

(a x) • {a • P — a — x) < a • (a ■ l3 — x) 

This can be transformed to: 
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0 < x{x — {a ■ {P — 2))) 

The equation holds for a: < 0 or a; > a • (/3 — 2), and no positive x < a can 
satisfy this inequality for /3 > 4. Contradiction. □ 

Lemma 7 (Extended Bandwidth Lemma). Let a and P > A he two arbitrary 
positive integers. Let oi, 02 ,..., Oq, and &i, & 2 , ■ • •, &/3 he two sequences of integers 
(numbers of nodes in clause and in variable gadgets). The sum of all elements 
in a and b adds up to a ■ P + a ■ 2 (number of nodes). Also we have Oi < 2 ■ a 
(variable gadget node hosting capacity - equal to number of leaves), and 6^ < 3 
(clause gadget node hosting capacity). Lf uplink of variable gadget does not 
exceed bandwidth constraints 

■ bi ■ (a ■ P + 2 ■ a — bi) < a ■ {a ■ P — a + 2 ■ a), 

and uplink of clause gadget does not exceed bandwidth constraints 

'^i<a '■ Oi ■ {a ■ P + 2 ■ a — Oi) < 2 ■ {a ■ P — 2 ■ a — 2), 

then for each i < p: bi = a and for each i < a: ai = 2 (we have expected 
number of nodes in variable and clause gadgets). 

We can prove the extended bandwidth lemma by pidgeon hole principle. 
However, easier way exists. We sum available bandwidth on all uplinks of clause 
gadgets to C and bandwidth an uplinks of variable gadgets to V. The only way 
that we can distribute nodes between clause and variable gadgets is to have 2 • a 
in total in clause gadgets and a • /3 in variable gadgets. To conclude, we apply 
bandwidth lemma 6 to clause gadgets and separatly to variable gadgets. 

5 Related Work 

There has recently been much interest in programming models and distributed 
system architectures for the processing and analysis of big data (e.g. [2, 6, 17]). 
The model studied in this paper is motivated by MapReduce [6] like batch¬ 
processing applications, also known from the popular open-source implemen¬ 
tation Apache Hadoop. These applications generate large amounts of network 
traffic [4, 10, 18], and over the last years, several systems have been proposed 
which provide a provable network performance, also in shared cloud environ¬ 
ments, by supporting relative [11, 12, 15] or, as in the case of our paper, abso¬ 
lute [3, 9, 13, 14, 16] bandwidth reservations between the virtual machines. 

The most popular virtual network abstraction for batch-processing appli¬ 
cations today is the virtual cluster, introduced in the Oktopus paper [3], and 
later studied by many others [10, 16]. Several heuristics have been developed 
to compute “good” embeddings of virtual clusters: embeddings with small foot¬ 
prints (minimal bandwidth reservation costs) [3, 10, 16]. The virtual network 
embedding problem has also been studied for more general graph abstractions 
(e.g., motivated by wide-area networks). [5, 7] 
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6 Summary and Conclusion 

We shown that several embedding problems are NP-hard already in three-level 
trees—a practically relevant result given today’s datacenter topologies [1]) —and 
even if the the number of replicas is bounded by two. 
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